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Abstract 

Two proteins, one belonging to the mainly a class and the other belonging to the a/ (3 class, 
are selected to test a kinetic mechanism for protein folding. Targeted molecular dynamics is 
applied to generate folding pathways for those two proteins, starting from two well defined initial 
conformations: a fully extended and a a-helical conformation. The results show that for both 
proteins the a-helical initial conformation provides overall lower energy pathways to the native 
state. For the a/ (3 protein, 30 % (40%) of the pathways from an initial a-helix (fully extended) 
structure lead to unentangled native folds, a success rate that can be increased to 85 % by the 
introduction of a well-defined intermediate structure. These results open up a new direction in 
which to look for a solution to the protein folding problem, as detailed at the end. 
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INTRODUCTION. 



Proteins are mega molecules. Even a small protein, with 60 amino acids, possesses 
approximately one thousand atoms which are linked by covalent bonds, hydrogen bonds, 
electrostatic interactions and van der Waals interactions. Despite the inherent complexity 
of such a system, proteins, in cells, are capable of folding reproducibly to a well defined three- 



dimensional structure that is generally irregu 
According the thermodynamic hypothesis 



ar and lacks symmetry. How do proteins do it? 
1| the native structure of proteins corresponds 



to the minimum of the free energy and protein folding is a thermodynamic process simply 
driven by thermal agitation. The complementary proposal that the free energy landscape is 
funnel-shaped was later put forward to explain also why folding can proceed comparatively 
fast 2|-y]. However, in spite of many decades of study, and of the improvements in the 
accuracy of force fields js-?] and in the power of computer facilities js], O], it has not yet 
been possible to determine a protein structure solely from its amino acid sequence and the 
thermodynamic and funnel hypotheses remain to be proven. 

Here another possibility is considered, namely, that protein folding is a kinetic process 
and that the native structure is just one of the many kinetic traps in which proteins can 
find themselves in [lo|. In previous works U li] it has been shown that a protein from 
a given CATH 15| class can be forced into artificial, non-native structures that belong to 
other CATH classes and maintain these structures for at least 50 nanoseconds. These results 
favour instead a multi- funnel- shaped free energy landscape of proteins and a specific kinetic 
process for in vivo protein folding was suggested [l^. For a kinetic process to lead to a 
well-defined three dimensional structure everytime, it must be associated with a specific 

n 

pathway, as Levinthal first proposed pj|; however, and equally importantly, it must always 



start from the same well-defined initial conformation. In [ij] it was suggested that this 
specific conformation, that is, the conformation that all proteins have immediately after 
synthesis, is helical and that the first step in folding is the bending of this initial helix at 
specific amino acid sites. The purpose of the present study is to make a preliminary test of 
the efficiency of such a kinetic process for folding. To that end, two proteins, representative 
of the mainly a and a/P CATH 15| protein classes, were selected and pathways from the 
initial conformation to the native state were generated using Targeted Molecular Dynamics 



(TMD) 13] . In TMD simulations, harmonic restraints are added to the protein force field in 



order to drive an initial protein conformation to a given target conformation 17 . llSl ]. TMD 
simulations were first applied to the T ^ R transition of the protein insulin 17| and have 
since been used to study a variety of other problems, including, e.g., the conformational 



changes associated with the functioning of a molecular motor 



191], the elucidation of the 



reaction steps in the full catalytic cycle of a protein involved in an electron transfer process 



20| . as well as in protein folding 2l|. Thus, TMD simulations have generally shown their 



usefulness to model processes that involve large conformational changes. 

Experimentally, it has not yet been possible to determine the structure of nascent chains, 
that is, of the structure proteins have as they come out of the ribosomal tunnel but it is known 
that both fully extended and a- helical structures fit into the tunnel dimensions [22]. In the 
present study, TMD simulations are used to compare the efficiency of folding from an initial 
helical conformation, proposed previously to be the conformation of proteins immediately 
after synthesis 



ij] to the efficiency of folding from a fully extended conformation. 



Two proteins, each representative of one of the four main protein CATH classes 



id, 



were chosen and their native structures were obtained from the Protein Data Bank (PDB) 



231]. One of these proteins (PDBIBDD 2J], 60 amino acids, 941 atoms, mainly a) has a 



native structure constituted by three a-helices, while the second protein (PDBIIGD [25], 61 
amino acids, 927 atoms, has a native structure that includes one a-helix and a /3-sheet. 
Two initial conformations, both of which have been identified as viable experimentally 22], 
were taken for each of the two proteins, namely, one conformation in which the backbone 
is fully extended and a second conformation in which the backbone is folded into an ideal 
a-helix. The energy minimized versions of these two conformations were used as initial 
conditions for the TMD simulations. The ff99SB force field 26l, with an implicit solvent, 

n 

implemented in AMBER 9 [27|, which has been shown to give a better representation of 
secondary structure 28|], was used in all simulations. For each initial conformation, 20 
independent TMD trajectories to the native state were generated by changing the seeds of 
the random forces in the Langevin thermal baths, with T = 298 K in all cases. In TMD 
simulations the harmonic forces that drive the initial conformations to the native structure 
arise from the following term added to the atomic potential energy function: 



hN (RMSDjv(t) - RMSDtargetW)' 



^TMD = ^kN (RMSDjv(t) - RMSDtarget(t))^ (l) 

where N is the number of atoms used to calculate the root mean square deviation per 



TABLE I. RMSDtarget(i = 0), in A, for the two initial conformations of the two proteins. 



PROTEIN INITIALLY EXTENDED INITIALLY a-hehx 

mainly a (IBDD) 55.78 21.54 

a/P (IIGD) 59.76 24.46 



atom (RMSD), RMSDAr(t) being the RMSD, with respect to the native structure, of the 
conformation the protein has at time t, and RMSDtarget(^) being the target RMSD at that 
time. In the simulations reported here, k = 100 kcal/mol/A^ and is the total number of 
carbons, nitrogens and oxygens in the backbone of the two proteins selected. 

In TMD simulations, RMSDtarget(^) is a linear function of time, being completely de- 
fined by the values at two time instants. Here, these two times are 1) its initial value, 
RMSDtarget(^ = 0), giveu in table HI and 2) its final value, set to 0.1 A in all trajectories. 
As shown in table [H for both proteins, the a- helical conformation is more than 2.4 times 
closer to the native structure than the fully extended conformation. Preliminary simulations 
starting from the a-helical conformation showed that the total potential energy (including 
the TMD term) does not vary much until RMSDat 6 A and so, in order to make the 
simulations with the two different initial conformations as equivalent as possible, in all the 
TMD simulations presented here, two RMSDtarget(^) functions were used, i.e., the first 0.1 
ns were spent in the convergence of the initial conformation to within 6 A of the native 
structure, that is, RMSDtarget(^ = 0.1 ns) = 6 A, and 0.4 ns were allowed for the further 
convergence to within 0.1 A of the native conformation. In this way, a slower rate of change 
of the RMSD is imposed in the final stages of convergence to the native state, allowing more 
time for the side chains to avoid steric overlaps, and, most importantly, making the rate of 
RMSD change in this latter stage the same for all simulations. The overall duration of each 
TMD simulation was 0.5 ns, comparable to other TMD protein folding simulations 21|. 



Each of the two initial conformations was taken as an initial condition in 20 independent 
TMD simulations, as detailed above. Figure [T] shows the average over the 20 trajectories 
and the corresponding standard deviations of the total potential energy as a function of 
instantaneous value of RMSD^v for the mainly a IBDD protein {24]. The curve in red is 
for the trajectories starting with a a-helical conformation and the curve in green is for the 
trajectories starting with a fully extended backbone conformation. Included in the values 



displayed in figured] is the contribution of the TMD term ([T]) which, however, only starts to 
rise above 0.3 % of the total value when the RMSD distance to the native structure becomes 
less than 0.6 A (not shown). This means that the energetics of these folding pathways is 
determined essentially by the atomic interactions in the ff99SB AMBER potential {26]. 

Figured] shows that the pathways from the initially extended conformations are populated 
by protein conformations which have, on average, a potential energy that is much greater 
than those that arise when the initial conformation is a-helical; indeed, even in the final 
stages of approach to the native state, the former protein conformations are at least 40 
kcal/mol above the latter. Furthermore, contrary to what was found in previous folding 
simulations and below for the IIGD protein, all pathways of the mainly a IBDD 
protein lead to conformations with a perfectly folded backbone at least within 0.3 A of the 
native structure. 

To test further the hypotheses put forward in [l^, another protein, the IIGD protein 



251, whose native structure includes both an a-helix and a four stranded /3-sheet, was used. 
As reported previously [21], many of the TMD folding simulations of the IIGD protein 
lead to final structures that, although apparently close to the native structure, differ from it 
by entangled backbone folds such as those highlighted by the red circles in figure [2] Indeed, 
inspection of the trajectories with the Visual Molecular Dynamics (VMD) software jsOj 
reveals that 12 (14) of the 20 TMD simulations starting with a fully extended (a-helical) 
conformation lead to such entangled final structures. These entanglements can be resolved 
by letting the chains go through each other, an artificial process enabled by the TMD term 
(d]) but of course penalized by extremely large values of the potential energy. Thus, in figure 
[3] the averages are made over viable pathways only, that is, over the TMD trajectories in 
which such backbone entanglements did not occur. Although the TMD term contributes 
a little more than before to the energetics of the pathways, its total amount only rises 
above 1 % when the RMSD distance to the native structure goes below 1 A, with the larger 
values being associated with entangled structures (not shown). Figure [3] shows that, also 
for the IIGD protein, the viable pathways from an initial a-helix to the native state 
are much less energetic than the corresponding ones from a fully extended structure, with 
the conformations covered by the latter having a potential energy more than 130 kcal/mol 
greater than the former to start with, and with their values only merging when the RMSD 
distance to the native structure is approximately 6 A. 



In {l^ it is suggested that the initial conformation of a protein is a hehx and that the 
first step in protein folding is the bending of this helix at specific amino acid sites. In the 
case of proteins whose secondary structure is just a set of helix bundles, as in the mainly 



j^BDD p.otein this fest step . the only .na.o. step .n theh folding pathways. VMD 
[30[ animations of the TMD trajectories of the IBDD protein do show pathways of this sort 
when the initial conformation is a-helical (not shown). (When the initial conformation is 
fully extended, on the other hand, the formation of the a-helices constitutes the last step, 
coming after the backbone has folded to the native topology). For proteins whose native 
structure includes both helices and sheets, as is the case of the a/ [3 IIGD protein, it was 
further suggested [l^ that the formation of the /3-sheets is due to destabilizing interactions 
between the amino acid side chains that are thus thrown together by the first step. VMD 
3o| animations of the TMD trajectories of the a/ (3 IIGD protein starting from the a-helical 
conformation reveal pathways that start off in that manner but in which the helical portions 
very quickly become distorted because the convergence to the right secondary structure is 
mixed with the convergence to the right backbone fold (not shown). This is to be expected 
because of the unspecific character of the TMD term ([T]). In order to test, in a more direct 
way, the effect that an intermediate all-helical conformation has on the folding efficiency of 
the a/fi IIGD protein, such a putative intermediate was generated by substituting the (3- 
sheets in the native structure of IlIGD by a-helices, as shown in figure |H This intermediate 
shall here be called an embrio because, according to the kinetic process proposed in its 
structure is one of the main determinants of the final, native, structure of the protein. 

Further TMD simulations were run by first driving an initial a-helix to the embrio, and 
then by driving the embrionic conformations to the native structure of IIGD. As the RMSD 
between the a-helix and the embrio is 23.78 A and the RMSD between the embrio and the 
native IIGD structure is 15.05 A, in order to keep approximately the same rate of change of 
RMSD as before, the first simulations had a duration of 0.05 ns and the second simulations 
had a duration of 0.45 ns. In the first 0.05 ns all 20 trajectories converge to within 0.1 A of the 
embrio structure; furthermore, 17 out of the 20 trajectories from the embrio to the native 
state lead to un-entangled final structures (not shown). Figure [31 in which the potential 
energies of the protein conformations in these 17 viable folding pathways are displayed in 
blue, shows that the a-helix — > embrio — )■ native trajectories provide, on average, the lowest 
energy folding pathways for the a//3 IIGD protein. 



This study does not prove the hypothesis that the a-hehx is the conformation that all 
proteins have as they come out of the ribosomal tunnel; it does not prove either the hypoth- 
esis that the first step in the folding of all proteins is the formation of a compact core, here 
called an embrio, constituted only by helices (and disordered helices or turns), nor does it 
prove that /3-sheets arise when one or more of the helices in that embrio is not stable; but 
it does show that those hypotheses are viable in that they can lead to low energy folding 
pathways to correctly folded, unentangled, native states. The simulations reported here also 
show that, of the two nascent chains that have been identified as experimentally feasible 22| . 
that in the form of a helix is to be preferred to a fully extended conformation. Furthermore, 
the results here provide a glimpse of the many entangled states that can be expected to arise 
if protein folding is just a random search of the native structure, driven by thermal noise, 
especially if the initial structure is also arbitrary. Indeed, only 30 % and 40 % of the direct 
pathways to the native state, from a a-helical and a fully extended initial conformation, 
respectively, lead to a correctly folded IIGD protein, even in the presence of the artificial 
TMD terms ([I]). On the other hand, the simulations here also show how a well defined 
intermediate/transient structure in the folding pathway can dramatically increase the odds 
of reaching the native structure in a reproducible manner: with the roughly built embrio 
tentatively considered here (see figure H]) the probability of reaching the native structure 
became 85 %! The really important point, though, is the new direction in which to look for a 
solution to the protein folding problem that is opened up by these and other results: 
instead of a thermodynamic process, a kinetic process; instead of arbitrary initial structures, 
a well defined and already ordered nascent chain; instead of many pathways, or even preferred 
pathways, a specific pathway for folding. Another important point is the concrete program 
of research that naturally follows from it [l^: 1) the identification of the key amino acids, 
or amino acid sequences, at which the bends of the initial a-helix occur, can be made by a 
statistical analysis of the known structures of proteins formed by helical bundles only; 2) the 
same structures can be mined to find the relative orientations of the helical pieces in their 
respective embrios, and 3) once the key amino acids and these initial orientations have been 
identified, a statistical analysis of the known structures of a/ (3 proteins may tell us which 
amino acid sequences in these initial helical pieces generate repulsive side chain interactions 
that, in turn, destabilize those helices and lead to the formation of /3-sheets. It is indeed 
hoped that the results here will inspire further work to explore this new direction. 
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* Most of these simulations were performed at the Milipeia cluster of the Laboratory for Ad- 
vanced Computing of the University of Coimbra, Portugal. 
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CAPTION LIST 



1. (color online) Average total potential energy (kcal/mol) over 20 independent TMD 
simulations, together with the corresponding standard deviations, as a function of 
the RMSD distance to the native structure of the mainly a IBDD protein {2^. The 
starting conformation is fully extended (a-helical) for the green (red) curves. The 
inset shows the final stages of convergence to the native structure. 

2. (color online) The native fold of the a/P IIGD protein is depicted on the left and on 
the right the red circles highlight two of the most common backbone entanglements 
that arose in the TMD simulations of this protein. The figures were made with the 
VMD software [30 1. 



3. (color online) Average total potential energy (kcal/mol) over non-entangled TMD 
simulations, together with the corresponding standard deviations, as a function of the 
RMSD distance to the native structure of the IIGD protein |25|. The starting 
conformation is fully extended (a-helical) for the green (red) curves and the embrio (see 
text and figure H]) for the blue curves. The inset shows the final stages of convergence 
to the native structure. 

4. (color online) The putative embrio (see text) built from the native fold of the a/P 
IIGD protein by substituting the native /3-sheets by a-helices. This figure was made 



with the software VMD 



30| 
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FIG. 3. 
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